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ABSTRACT 

The intrinsic scatter in the ellipticities of galaxies about the mean shape, known as 
“shape noise,” is the most important source of noise in weak leasing shear measure¬ 
ments. Several approaches to reducing shape noise have recently been put forward, 
using information beyond photometry, such as radio polarization and optical spec¬ 
troscopy. Here we investigate how well the intrinsic ellipticities of galaxies can be 
predicted using other, exclusively photometric parameters. These parameters (such as 
galaxy colours) are already available in the data and do not necessitate additional, 
often expensive observations. We apply two regression techniques, generalized addi¬ 
tive models (GAM) and projection pursuit regression (PPR) to the publicly released 
data catalog of galaxy properties from CFHTLenS. In our simple analysis we find that 
the individual galaxy ellipticities can indeed be predicted from other photometric pa¬ 
rameters to better precision than the scatter about the mean ellipticity. This means 
that without additional observations beyond photometry the ellipticity contribution 
to the shear can be measured to higher precision, comparable to using a larger sample 
of galaxies. Our best-fit model, achieved using PPR, yields a gain equivalent to hav¬ 
ing 114.3% more galaxies. Using only parameters unaffected by leasing (e.g. surface 
brightness, colour), the gain is only « 12%. 

Key words: methods: data analysis - methods: statistical - surveys - galaxies: 
statistics - galaxies:structure - cosmology: observations 


1 INTRODUCTION 

Weak gravitational lensing of galaxies is the distortion of 
galaxy shapes and sizes viewed behind a distribution of grav¬ 
itating matter (see e.g. Bartelmann & Schneider 2001 or 
Massey, Kitching & Richard 2010 for reviews). The change 
in galaxy shapes, known as cosmic shear, has become one of 
the main probes of cosmology due to its dependence on the 
total matter distribution and cosmic geometry (e.g. Kaiser 
1998). It is a driver for many ambitious upcoming instru¬ 
ments, including LSST[^Euclid|^and WFIRST (Spergel et 
al. 2015). On an individual galaxy basis, cosmic shear in¬ 
duces changes in the ellipticity and position angle at the few 
percent level. Determination of cosmic shear therefore re¬ 
lies on measuring coherent distortions, averaging over large 
numbers of galaxies. The dominant source of noise in this 
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measurement is so-called “shape noise,” which is due to the 
fact that the unlensed galaxies have an intrinsic distribution 
of ellipticities and position angles. This distribution must be 
averaged over to reveal the cosmic shear contribution. If the 
shapes and position angles of the unlensed galaxies were 
known, then this shape noise could be eliminated, and con¬ 
sequently many fewer galaxies would be needed to achieve 
a given precision in cosmic shear. 

This realization has given rise to several proposed tech¬ 
niques to determine the unlensed shapes of individual galax¬ 
ies, using additional observables beyond photometry. The 
most prominent idea is to use spectroscopic information 
to do this. Early work on this subject involved spatially 
resolved kinematic maps of galaxies (Blain 2002, Morales 
2006). Recently, Huff et al. (2013) have shown than the disk 
galaxy line width-luminosity relationship (Tully & Eisher 
1977) can in principle be used to elminate shape-noise as 
an important source of noise altogether. This is extremely 
promising, and these authors have shown that spectroscopic 
lensing survey concepts can be conceived which are signifi- 
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cantly smaller in scale than LSST but which are highly com¬ 
petitive in terms of predicted dark energy constraints. Other 
recent work by Brown & Battye (2011) has shown how po¬ 
larization angles measured from radio observations can yield 
intrinsic galaxy positions angles. Again these techniques re¬ 
quire the use of additional information beyond galaxy pho¬ 
tometry. 

There is information in photometry itself however on 
the intrinsic shapes of galaxies. For example, there are 
well-known relationships between the inclination angles of 
galaxy disks and their surface brightnesses (e.g. Giovanelli 
et al. 1994). One could imagine measuring the surface bright¬ 
ness (which is unaffected by leasing) from images and then 
using this relationship to infer something about the unlensed 
shape. In this paper, we will extend this idea to all photomet¬ 
rically measurable information, and apply it to a published 
observational dataset, the Canada-France-Hawaii Telescope 
Leasing Survey (CFHTLenS; Heymans et al. 2012). The 
question we will try to answer is: is it possible to reduce 
the shape noise in weak leasing shear without resorting to 
extra observables beyond photometry (which are often ex¬ 
pensive to obtain)? 

There are many photometric variables which can be 
measured from galaxy images (which often includes colour 
information from different filters). Some of these variables 
are affected by leasing (for example the size or the apparent 
magnitude of galaxies) and others not (such as the surface 
brightness or the photometric redshift). There will be cor¬ 
relations between these variables and the intrinsic ellipticity 
of galaxies, and in this paper we will investigate how these 
correlations can be used to predict the intrinsic ellipticity. 
We will be using a set of 16 parameters for each galaxy taken 
from those measured and published by the CFHTLenS team 
(see Section]^ for details). Because there are a large num¬ 
ber of parameters, we will not look at correlations of each 
individually with galaxy ellipticity, but instead use two re¬ 
gression techniques, generalized additive models (GAM) and 
projection pursuit regression (PPR), to optmize ellipticity 
prediction in the multidimensional parameter space. 

The outline of this paper is as follows: in Section 2 we 
briefly outline how galaxy ellipticities are defined and can 
be used to infer the shear due to weak leasing. In Section 

3 we introduce the data from CFHTLenS, and in Section 

4 we describe our method for predicting ellipticities from 
other photometric parameters. In Section 5 we present our 
results for how well the ellipticities can be predicted, using 
observational data. We summarise and discuss our findings 
in Section 6 . 


2 WEAK LENSING SHEAR AND 

ELLIPTICITY MEASUREMENTS 

We note that the intrinsic ellipticities of galaxies are of 
course not available in the CFHTLenS dataset, and so when 
making predictions for them, we will compare the predic¬ 
tions to the actual measured ellipticities. Because the ef¬ 
fects of weak leasing shear on the ellipticities are galaxies 
are much smaller (around the percent level) than the error 
on the predicted ellipticities, this will be a good approxima¬ 
tion to comparison to the intrinsic ellipticities. 

The weak gravitational leasing shear can be decom¬ 


posed into the two usual components, sometimes denoted 
as 7 x and 7 +, which distort the position angle and elliptic¬ 
ity of the galaxy image. The ellipticity e, of a galaxy image 
is given by 



where q = b/a, the ratio of minor to major axis. The distor¬ 
tion caused by the shear means that the observed value of q 
for a galaxy is given by 

Qobs ~ Qunlensed(l 4“ ^7+), (2) 


with the position angle, 9, changing as follows: 
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where guniensed and Suniensed are galaxy parameters before 
lensing distortion. It follows that an estimator for the shear 
component 7 + could be 
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which would necessitate knowledge of the unlensed galaxy 
shape, Quniensed. In this paper, we will see whether the un¬ 
lensed ellipticity can be inferred from other parameters. We 
leave the determination of a shear estimator that makes best 
use of this information to future work. 

We note that in our work we will not be able to infer the 
unlensed position angles. There will therefore be no reduc¬ 
tion of shape noise for one of the two shear components, 7 x. 
This is likely to be a signihcant limitation, as for example 
Whittaker et al. (2014) have shown that shear estimators 
can be constructed using galaxy position angles only, and 
which appear to contain most of the shear signal. 


3 DATA 

We use the publicly available dat^from GFHTLenS in our 
analysis. GFHLenS is a 154 square-degree multi-colour op¬ 
tical survey in ugriz incorporating all hve years worth of 
data from the Wide, Deep and Pre-survey components of 
the CFHT Legacy Survey]^ The CFHTLenS was optimised 
for weak lensing analysis with the deep i-band data taken 
in optimal sub-arcsecond seeing conditions. For a general 
overview of the survey see Erben et al. (2013) and Heymans 
et al. ( 2012 ), as well as information about the photometry 
in Hildebrant et al. (2012). 

The online datastore® contains 107 photometrically de¬ 
rived parameters for each of 8.05 million galaxies, ranging 
from the number of exposures, through galaxy angular po¬ 
sitions and photometric redshifts, image ellipticity compo¬ 
nents and point spread functions. The algorithm lensfit 
(Miller et al. 2007) was used by Miller et al. (2013) to 
carry out the Baysian estimation of the galaxy shapes. The 
lensfit shapes require a multiplicative (Miller et al. 2012) 
and an additive (Heymans et al. 2012) correction to be prop¬ 
erly calibrated. Corrections were made (specihcally to the 
measurement e 2 ) prior to the start of our analysis (M. Simet, 
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private communication). From the possible pool of predic¬ 
tor variables, we concentrate on a particular subset of 16, 
which we list in Table [T] The reader is referred to the 
CFHTLens publications and catalogue documentation (both 
listed above) for detailed definitions of these parameters and 
explanations of how they were measnred. Note that we do 
not a priori expect two of these parameters (t_ml and t_b) 
to have any predictive power in estimating galaxy elliptic- 
ity; in a sense, these are control variables added to ensure 
proper performance of our analysis software/algorithm. 

For computational efficiency, we select data spanning 
100,000 contiguous rows of the catalog, then exclude those 
whose measurements include the values 99 or -99, those clas- 
sihed as stars with high probability (class_star > 0.95), 
and those for which (fitclass I star_flag) 7 ^ 0. The fi¬ 
nal sample size is 89,990, which is sufficiently large to probe 
the space spanned by the predictor variables. We compare 
galaxy ellipticties that we predict from other photometric 
parameters to the observed values from lensfit (i.e. ellip 
in Table [^. The mean value of ellip in our 89,990-galaxy 
sample is e = 0.3412, while the root-mean-square (RMS) 
deviation from the mean is 


fiRMS 


\ 


iV. 


^^(ei-e)2 


: 0.1759. 


gal 


( 5 ) 


4 ANALYSIS 


Our interest lies in predicting ellipticities as a function of the 
predictor variables in Table There are many fitting tech¬ 
niques available from the realms of statistics and machine 
learning that may be applied to this problem; we Hnd that a 
combination of two regression techniques, generalized addi¬ 
tive models (GAM; see e.g. Chapter 7 of James et al. 2013) 
and projection pursuit projection (PPR; Friedman & Stuet- 
zle 1981), yields encouraging results. In short, we use GAM 
to select a set of predictors from the pool of possibilities in 
Table without testing for interactions (which adds undue 
computational complexity within the GAM framework), and 
then apply PPR, which works with linear combinations of 
predictors, to generalize the GAM model. 

The GAM model is 

p 

e = /3o + '^Pjfj(xj), ( 6 ) 

where p is the number of predictors and fj{xj) is a nonpara- 
metrically smoothed version of the predictor Xj. (In our 
analysis, we use the gam. fit function of the R package mgcv, 
and apply smoothing splines to each predictor individually.) 

To avoid overfitting, we apply a forward-stepwise 
search, wherein we test add each predictor from the pool 
individually to the baseline model, and see which achieves 
the greatest reduction in mean squared error (MSE): 


MSE 






(7) 


where d and ii are the measured and predicted ellipticities 
respectively of galaxy i and Wi is that galaxy’s weight as 
estimated by lensfit (Miller et al. 2012). 


To generate predictions d, we apply five-fold cross- 
validation, i.e. we randomly partition the data into five 
groups, and at any one time fit our GAM model to four 
of them to generate predictions for the hfth group (repeat¬ 
ing the process until predictions are generated for all data). 
To determine whether the reduction in MSE is statistically 
signihcant, we repeat each ht of each predictor variable ten 
times (i.e. we randomly partition the data into hve groups 
ten separate times) to generate a distribution of MSEs. 
Given the MSE distributions for the baseline and baseline- 
plus-new-predictor models (which we assume are normal), 
it is trivial to apply the two-sample t test to assess the null 
hypothesis that the distributions have the same mean. If 
the t test results in a p value < 0.05, we reject the null and 
incorporate the new predictor into our baseline model. We 
then repeat the search over the remaining predictors. Note 
that as part of this process we check to see if logarithmic or 
exponential transformations of the predictors lead to height¬ 
ened reductions in MSE. We show our results in Figure 
the hnal GAM model, which includes 13 predictors, reduces 
the MSE from 0.03424 (the value for a constant model) to 
0.01791. 

Given the set of predictor variables produced in the 
GAM step, we test for interactions among them via PPR. 
The PPR model is 

M 

e = Po + ^Pkfkialyi), ( 8 ) 

k=l 


where fkict^ix) is the fe*’*' “ridge function” and where the 
number of ridge functions M is selected via the same MSE- 
reduction test outlined above. In our analysis, we apply the 
base R function ppr, and we choose the “supersmoother” 
function of Friedman (1984) as the smoothing function /*(•). 
The hnal MSE is 0.01622 for M = 8 ; this reduction in ellip- 
ticity error relative to a constant model (MSE = 0.03424) is 
equivalent to that achieved using the constant model and a 
dataset 


ricon _ MSE^on _ 0.0324 
PPP ^ MSE^p, “ 0.01622 


( 9 ) 


times larger, i.e. 111 . 1 % larger. 

In Figure]^ we examine the effect of removing each of 
the predictors in turn on the MSE of the best-ht PPR model. 
(Note that we do not attempt to optimize the number of 
terms in each case but rather assume M = 8 .) The predic¬ 
tors are shown left-to-right in the order in which they were 
admitted to the no-interaction GAM model. The effect on 
MSE of each parameter largely mimics the t statistics asso¬ 
ciated with those parameters in Figure]^ with the suprising 
exception that mag_z, the first predictor chosen in the GAM 
step, can be excluded from the predictor pool made avail¬ 
able in the PPR step with no loss in predictive accuracy. 
Similarly, the last four predictors adopted in the GAM step 
(mag_u, mag_r, z_ml, and z_b), can similarly be dropped from 
the pool. This serves to highlight the complexity of statisti¬ 
cal model selection: the GAM step, designed to reduce the 
number of possible predictors from ~ 100 available in gen¬ 
eral (or from the 16 that we pre-selected for this particular 
exercise) to a more manageable pool, still ends up selecting 
more than is ultimately necessary because it does not take 
predictor interactions into account. 

We thus perform the one additional step of removing 
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Figure 1. Result of forward-stepwise model search using gener¬ 
alized additive model regression. Values of the t statistic for the 
two-sample t test are shown along the y-axis. (See the text for 
details on how we apply the two-sample t test.) Predictors are 
admitted into the GAM model one at a time, in the order shown 
from left to right. The p value for admitting tjnl is 0.067 and 
thus it was not admitted to the final GAM model. 


Figure 2. Measured ellipticity versus predicted ellipticity for the 
best-fit PPR model with MSE = 0.01598 (i.e. = 0.1264). As 

detailed in the text, use of the best-fit PPR model is equivalent 
to the application of a constant model to a dataset with ^ 110% 
more galaxies than the current one. 


the five predictors listed above from the pool of predictors 
made available to the PPR model and re-running the PPR 
analysis. We achieve an MSE of 0.01598 for M = 8, which 
is equivalent to applying the constant model to a dataset 
that is 2.143 times, or 114.3%, larger. In Figure [^we show 
the relationship between predicted and measured ellipticity; 
the Pearson sample correlation coefficient between the two 
is R = 0.667. 

To illustrate the difference between choosing from all 
predictors versus only those not affected by leasing, we 
apply the PPR framework to only the set of parameters 
area_world, mu_max, mag_(u,r, i ,z) , and z_(b,ml). We test 
various combinations of these parameters. First, we test 
models with area_world and mag_r and models that keep in¬ 
formation on surface brightness only by combining the two 
as mag_r/area_world. Second, we test models incorporating 
colours as opposed to magnitudes. Regardless of model, the 
result is qualitatively similar: the reductions in MSE rela¬ 
tive to that of the constant model are equivalent to using 
datasets that are 11.2%-12.5% larger, a far smaller improve¬ 
ment than the 114.3% gained from examining all predictors. 

5 SUMMARY AND DISCUSSION 
5.1 Summary 

We utilize a statistical framework based on generalized addi¬ 
tive model (GAM) regression and projection pursuit regres¬ 
sion (PPR) to predict galaxy ellipticities from other photo¬ 
metric parameters, and apply it 89,990 galaxies taken from 



Figure 3. Mean-squared error (MSE) resulting from the removal 
of each of the named predictors in turn from the pool of pre¬ 
dictors available to the PPR model. The red dashed line indi¬ 
cates the MSE for the best-fit PPR model, and the error bars 
are la estimates based on 10 repetitions. This figure indicates 
that by including linear combinations of predictors, several pre¬ 
dictors that were significant in the no-interaction GAM model 
(mag_z,inag_u,inag_r,z_ml,z_b) can be excluded in the PPR model. 
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Table 1. Variables examined in our GAM-PPR framework. See e.g. Erben et al. (2013), Table Cl. 


Variable Description 


Predictor Variables: 
area_world 

f lux_radius 

f wlim_world 

mag_[u,g,r,i,z] 

model_f lux 

mu_max 

scalelength 

snratio 

t_[b,inl] 

z_[b,inl] 


Galaxy area in world coordinates (= ttX a_world X b.world; 
the latter two quantities are estimated via SExtractor) 

Galaxy half-light radius, estimated via SExtractor 

Galaxy FWHM assuming a Gaussian profile, estimated via SExtractor 
Galaxy ugriz magnitudes, estimated via SExtractor 
Galaxy flux, estimated via lensfit 

Galaxy peak surface brightness, estimated via SExtractor 
Galaxy scalelength, estimated via lensfit 
Galaxy signal-to-noise ratio, estimated via lensfit 
Spectral type, estimated via BPZ 

Galaxy peak-posterior/maximum likelihood photometric redshifts, estimated via BPZ 


Response Variable: 

ellip Galaxy ellipticity, estimated via lensfit (= x/el^ -P e2^) 

(e2 corrected by M. Simet, private communication) 


Fit Weight: 

weight Galaxy weight in fitting, estimated via lensfit; 

see Section 3.6 and Equation 8 of Miller et al. (2012) 


a value-added version of the public CFHTLenS catalog. Our 
findings are as follows: 

(i) Using a set of 13 parameters which include quanti¬ 
ties which are affected by lensing such as galaxy size and 
apparent magnitude, we find that the ellipticity of individ¬ 
ual galaxies can be predicted with an rms error Ue = 0.1264. 
This is 28.1% less than the rms standard deviation of galaxy 
ellipticities about the mean. The gain in predictive accu¬ 
racy relative to a constant model is equivalent to utilizing 
a constant model with a dataset 114.3% larger than our 
89,990-galaxy CFHTLenS-based dataset. This result con¬ 
clusively demonstrates that our statistical framework can 
reduce shape noise in weak lensing measurements. 

(ii) Using a reduced set of photometric parameters, those 
unaffected by lensing (such as colour and surface brightness), 
we find that the ellipticity of galaxies can be predicted with 
an rms error of Ue « 0.1749, 0.5% less than the rms standard 
deviation of galaxy ellipticities about the mean; the gain in 
predictive accuracy relative to a constant model is equivalent 
to utilizing a constant model with a dataset « 12% larger. 

5.2 Discussion 

Although we have shown that photometric information can 
be used to predict galaxy ellipticities, the scatter compared 
to the true values is still large, so that on a galaxy by galaxy 
basis, photometric information alone is not a viable to com¬ 
petitor to other methods which use additional osbervables. 
For example, Huff et al. (2013) have shown that spectro¬ 
scopic information can in principle reduce the effect of shape 
noise on both components of shear by an order of magnitude, 
rendering it negible, whereas we have only shown reduction 
by a few tens of percent. On the other hand, the photometric 
information will be present in catalogues without additional 
effort, so that using it should at least be considered. 

In our work there are two main distinctions between 
parameters, whether they are affected by lensing (e.g. size). 


or are unaffected (e.g. colour). A prediction of ellipticities 
from the latter parameters has the advantage that the pre¬ 
dicted ellipticity should not be affected by lensing. There 
should therefore be no correlation between the weak lensing 
shear that is eventually measured after using the predicted 
ellipticity, and the predicted ellipticity itself. This purity, as 
we have seen, does come at significant cost to the predictive 
power, and so it becomes necessary to consider the more 
inclusive set of parameters, which does not exclude those 
affected by lensing. In this case, because one can regard our 
prediction of ellipticities as being to first order, one might 
expect the effect of weak lensing on the parameters that 
enter into the prediction to modify the resulting predicted 
ellipticities only at second order. We therefore expect that 
the effect of lensing on the prediction should be small. We 
defer the developments of techniques to address this further 
to the future. 

In this paper, we have also left to future work to ex¬ 
plore how best the predicted ellipticity information can be 
incorporated into an estimator of the weak lensing shear. 
When this is done, the fact that ellipticity predictions from 
photometry only extend to galaxy shapes and not position 
angles, thus restricting any benefits to only one component 
of the shear should also be taken into account. It is possible 
that the predictions are also better for certain subsets of the 
data (e.g. bright galaxies) and this could also be explored. 

One potential complication which could conceivably af¬ 
fect the reliability of the techniques in this paper is that 
there may be environmental effects on the relationship be¬ 
tween photometric parameters and predicted ellipticities. 
This would manifest itself as spatial clustering in the residu¬ 
als of the relationship, and could cause systematic errors in 
the inferred shear. The magnitude of such effects could per¬ 
haps be gauged using measures of the environment (e.g. nth 
nearest neighbour distance). Spatial correlations in residuals 
from the Fundamental Plane (FP) relationship between pho¬ 
tometric and spectroscopic parameters of early-type galaxies 



6 R. A. C. Croft et al. 


have recently been detected (Joachimi et al. 2015), showing 
that such effects are present in related data. 
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